#
# numeric
# 1331
# [1] "Are all rows complete?: TRUE"
# [1] "Are there any NAs?: FALSE"
# [1] "Are any values negative?: FALSE"
If \(d > 100\) we reduce the number of columns using CUR decomposition. And if \(n > 1e3\) we reduce the number of rows using CUR decoposition.
Other methods are available.
The heatmap below is a representation of the data with values shown in color according to magnitude. Mouse hover for column names.
The violin plot combines a kernel density estimate with a boxplot for a more detailed vizualization. A jittered scatter plot of the points is overlaid. The jittering helps reduce effects of overplotting.
The correlation between two random variables is a measure of a specific type of dependence that involves not only the two variables themselves but also a random component. It measures to what degree a linear relationship exists between then two random variables, where 1 is corresponds to a direct linear relationship, 0 corresponds to no linear relationship, and -1 corresponds to an inverse linear relationship.
An outlier is a datapoint that lives relatively far away from the bulk of other observations. Outliers can have unwanted effects on data analysis and therefore should be considered carefully.
We use the built-in method from the randomForest package in R.
The variance measure how spread out the data are from their mean. Cumulative variance measures, as a percentage, how much variation each dimension contributes to the dataset.
In this implementation we use principal components analysis to select linear combinations of the features that explain the dataset best in low dimensions.
The plot below shows how much variance is explained when adding columns one at a time. The elbows denote good “cut-off” points for dimension selection.
A pairs plot is a popular way of plotting high-dimensional data.
For every pair of dimensions are plotted showing the specific projection of the data along those two dimensions.
For readability a maximum of 8 dimensions are plotted.
# NULL